 |
 |
XML for the absolute beginner
A guided tour from HTML to processing XML with Java

Printer-friendly
version | Mail this to a friend
Page 3 of 10
An XML conceptual example All
this talk of "inventing your own tags" is pretty foggy: What kind of tags
would a developer want to invent and how would the resulting XML be used?
In this section, we'll go over an example that compares and contrasts
information representation in HTML and XML. In a later section ("XSL: I
like your style") we'll go over XML display.
First, we'll take an example of a recipe, and display it as one
possible HTML document. Then, we'll redo the example in XML and discuss
what that buys us.
HTML example Take a look at the little chunk of
HTML in Listing 1:
<!-- The original html recipe
--> <HTML> <HEAD> <TITLE>Lime Jello
Marshmallow Cottage Cheese
Surprise</TITLE> </HEAD> <BODY> <H3>Lime
Jello Marshmallow Cottage Cheese Surprise</H3> My grandma's
favorite (may she rest in
peace). <H4>Ingredients</H4> <TABLE
BORDER="1"> <TR
BGCOLOR="#308030"><TH>Qty</TH><TH>Units</TH><TH>Item</TH></TR> <TR><TD>1</TD><TD>box</TD><TD>lime
gelatin</TD></TR> <TR><TD>500</TD><TD>g</TD><TD>multicolored
tiny
marshmallows</TD></TR> <TR><TD>500</TD><TD>ml</TD><TD>cottage
cheese</TD></TR> <TR><TD></TD><TD>dash</TD><TD>Tabasco
sauce
(optional)</TD></TR> </TABLE> <P> <H4>Instructions</H4> <OL> <LI>Prepare
lime gelatin according to package instructions...</LI> <!--
and so on --> </BODY> </HTML>
Listing 1. Some HTML
(A printable version of this listing can be found at example.html.)
Looking at the HTML code in Listing 1, it's probably clear to just
about anyone that this is a recipe for something (something awful, but a
recipe nonetheless). In a browser, our HTML produces something like this:
Lime Jello Marshmallow Cottage Cheese SurpriseMy grandma's
favorite (may she rest in peace).
Ingredients
| Qty |
Units |
Item |
| 1 |
box |
lime gelatin |
| 500 |
g |
multicolored tiny marshmallows |
| 500 |
ml |
Cottage cheese |
|
dash |
Tabasco sauce (optional) |
Instructions
- Prepare lime gelatin according to package instructions...
|
Listing 2. What the HTML in Listing 1 looks like
in a browser
Now, there are a number of advantages to representing this recipe in
HTML, as follows:
- It's fairly readable. The markup may be a little cryptic, but if
it's laid out properly it's pretty easy to follow.
- The HTML can be displayed by just about any HTML browser, even one
without graphics capability. That's an important point: The display is
browser-independent. If there were a photo of the results of making this
recipe (and one certainly hopes there isn't), it would show up in a
graphical browser but not in a text browser.
- You could use a cascading style sheet (CSS -- we'll talk a bit about
those below) for general control over formatting.
There's one major problem with HTML as a data format, however. The
meaning of the various pieces of data in the document is lost.
It's really hard to take general HTML and figure out what the data in the
HTML mean. The fact that there's an <Ingredient> of
this recipe with a <Qty> (quantity) of 500 ml
(<Units>) of <Item> cottage cheese
would be very hard to extract from this document in a way that's generally
meaningful.
Now, the idea of data in an HTML document meaning something
may be a bit hard to grasp. Web pages are fine for the human reader, but
if a program is going to process a document, it requires unambiguous
definitions of what the tags mean. For instance, the
<TITLE> tag in an HTML document encloses the title of
the document. That's what the tag means, and it doesn't mean anything
else. Similarly, an HTML <TR> tag means "table row,"
but that's of little use if your program is trying to read recipes in
order to, say, create a shopping list. How could a program find a list of
ingredients from a Web page formatted in HTML?
Sure, you could write a program that grabs the headers out of the
document, reads the table column headers, figures out the quantities and
units of each ingredient, and so on. The problem is, everyone formats
recipes differently. What if you're trying to get this information from,
say, the Julia Childs Web site, and she keeps messing around with the
formatting? If Julia changes the order of the columns or stops using
tables, she'll break your program! (Though it has to be said: If Julia
starts publishing recipes like this, she may want to think about changing
careers.)
Now, imagine that this recipe page came from data in a database and
you'd like to be able to ship this data around. Maybe you'd like to add it
to your huge recipe database at home, where you can search and use it
however you like. Unfortunately, your input is HTML, so you'll need a
program that can read this HTML, figure out what all the "Ingredients,"
"Instructions," "Units," and so forth are, and then import them to your
database. That's a lot of work. Especially since all of that semantic
information -- again, the meaning of the data -- existed in that original
database but were obscured in the process of being transformed into HTML.
Now, imagine you could invent your own custom language for describing
recipes. Instead of describing how the recipe was to be displayed, you'd
describe the information structure in the recipe: how each piece
of information would relate to the other pieces.
XML example Let's just make up a markup language
for describing recipes, and rewrite our recipe in that language, as in
Listing 3.
<?xml version="1.0"?> <Recipe>
<Name>Lime Jello Marshmallow Cottage Cheese
Surprise</Name>
<Description> My grandma's favorite (may
she rest in peace). </Description>
<Ingredients> <Ingredient>
<Qty
unit="box">1</Qty>
<Item>lime
gelatin</Item> </Ingredient> <Ingredient>
<Qty
unit="g">500</Qty>
<Item>multicolored tiny
marshmallows</Item> </Ingredient> <Ingredient>
<Qty
unit="ml">500</Qty>
<Item>Cottage
cheese</Item> </Ingredient> <Ingredient>
<Qty
unit="dash"/>
<Item optional="1">Tabasco
sauce</Item> </Ingredient>
</Ingredients>
<Instructions> <Step> Prepare
lime gelatin according to package
instructions </Step> <!--
And so on... -->
</Instructions> </Recipe>
Listing 3. A custom markup language for
recipes
It will come as little surprise to you, being the astute reader you
are, that this recipe in its new format is actually an XML document. Maybe
the fact that the file started with the odd header
<?xml version="1.0"?>
gave it away; in fact, every XML file should begin with this header.
We've simply invented markup tags that have a particular meaning; for
example, "An <Ingredient> is a <Qty>
(quantity in specified units) of a single <Item>, which
is possibly optional." Our XML document describes the
information in the recipe in terms of recipes, instead of in
terms of how to display the recipe (as in HTML). The semantics,
or meaning of the information, is maintained in XML because that's what
the tag set was designed to do.
Notes on notation It's important to get some
nomenclature straight. In Figure 1, you see a start tag, which
begins an enclosed area of text, known as an Item, according
to the tag name. As in HTML, XML tags may include a list of
attributes (consisting of an attribute name and an
attribute value.) The Item defined by the tag ends
with the end tag.
|
Figure 1. An XML start tag and its corresponding end tag
|
Not every tag encloses text. In HTML, the <BR> tag
means "line break" and contains no text. In XML, such elements aren't
allowed. Instead, XML has empty tags, denoted by a slash before
the final right-angle bracket in the tag. Figure 2 shows an empty tag from
our XML recipe. Note that empty tags may have attributes. This empty tag
example is standard XML shorthand for <Qty
units="g"></Qty>.
|
Figure 2. An empty tag
|
In addition to these notational differences from HTML, the structural
rules of XML are more strict. Every XML document must be
well-formed. What does that mean? Read on!
Ooh-la-la! Well-formed XML The concept of
well-formedness comes from mathematics: It's possible to write
mathematical expressions that don't mean anything. For example, the
expression
2 ( + + 5 (=) 9 > 7
looks (sort of) like math, but it isn't math because it doesn't follow
the notational and structural rules for a mathematical expression (not on
this planet, at least). In other words, the "expression" above isn't
well-formed. Mathematical expressions must be well-formed before
you can do anything useful with them, because expressions that aren't
well-formed are meaningless.
A well-formed XML document is simply one that follows all of the
notational and structural rules for XML. Programs that intend to process
XML should reject any input XML that doesn't follow the rules for being
well-formed. The most important of these rules are as follows:
- No unclosed tags
You can get away with all kinds
of wacko stuff in HTML. For example, in most HTML browsers, you can
"open" a list item with <LI> and never "close" it
with </LI>. The browser just figures out where the
</LI> would be and automatically inserts it for you.
XML doesn't allow this kind of sloppiness. Every start tag must have a
corresponding end tag. This is because part of the information in an XML
file has to do with how different elements of information relate to one
another, and if the structure is ambiguous, so is the information. So,
XML simply doesn't allow ambiguous structure. This nonambiguous
structure also allows XML documents to be processed as data structures
(trees), as I'll explain shortly in the discussion of the Document
Object Model.
- No overlapping tags
A tag that opens inside
another tag must close before the containing tag closes. For example,
the sequence
<Tomato> Let's call <Potato>the whole
thing off</Tomato> </Potato>
isn't well-formed because <Potato> opens inside of
<Tomato> but doesn't close inside of
<Tomato>. The correct sequence must be
<Tomato> Let's call <Potato>the whole
thing off</Potato> </Tomato>
In other words, the structure of the document must be strictly
hierarchical.
- Attribute values must be enclosed in
quotes
Unlike HTML, XML doesn't allow "naked" attribute
values (i.e., HTML tags like <TABLE BORDER=1>, where
there are no quotes around the attribute value). Every attribute value
must have quotes (<TABLE BORDER="1">).
- The text characters (<), (>), and (") must always be
represented by 'character entities'
To represent these three
characters (left-angle bracket, right-angle bracket, and double quotes)
in the text part of the XML (not in the markup), you must use the
special character entities (<),
(>), and ("), respectively.
These characters are special characters for XML. An XML file using, say,
the double quote character in the text enclosed in tags in an XML file
isn't well-formed, and correctly designed XML parsers will produce an
error for such input.
'Well-formed' means 'parsable' A generic XML
parser is a program or class that can read any well-formed XML at
its input. Many vendors now offer XML parsers in Java for free;
(you'll find links to these packages in Resources
at the bottom of this article). XML parsers recognize well-formed
documents and produce error messages (much like a compiler would) when
they receive input that isn't well-formed. As we'll see, this
functionality is very handy for the programmer: You simply call the parser
you've selected and it takes care of the error detection and so on. While
all XML parsers check the well-formedness of documents (meaning, as we've
seen, that all the tags make sense, are nested properly, and so on),
validating XML parsers go one step further. Validating parsers
also confirm whether the document is valid; that is, that the
structure and number of tags make sense.
For example, most browsers will display a document that (nonsensically)
has two <TITLE> elements, but how can this be? Only one
title or no title makes sense.
For another example, imagine that in Listing 3 the "cottage cheese"
ingredient looked like this:
<Ingredient>
<Qty
unit="ml">500</Qty>
<Qty
unit="g">9</Qty>
<Item>Cottage
cheese</Item> </Ingredient>
This XML document is certainly well-formed, but it doesn't make sense.
It isn't structurally valid. It is nonsense for a
<Qty> to contain a <Qty>. What's the
<Qty> of this <Ingredient>?
The problem is, we have a document that's well-formed, but it isn't
very useful because the XML doesn't make sense. We need a way to specify
what makes an XML document valid. For example, how can we specify that a
<Qty> tag may contain only text (and not any other
elements) and report as errors any other case?
The answer to this question lies in something called the document
type definition, which we'll look at next.
Next
page > Page 1 XML
for the absolute beginner Page 2 HTML:
All form and no substance Page 3 An XML conceptual example
Page 4 Make
up a markup Page 5 So,
what good is made-up markup? Page 6 Cascading
Style Sheets: not just for HTML anymore Page 7 XSL:
I like your style Page 8 Modeling
information structure in XML Page 9 XML
and Java Page 10 Become
a tree surgeon!
Printer-friendly
version | Mail this to a friend
Resources There are so
many XML resources on the Web, I've had to categorize. The first section
here is the most useful, since the documents are either high-level
summaries or excellent link sites. Apologies to anyone who was omitted.
XML and Java: General XML resources
- "XML, Java and the Future of the Web," Jon Bosak. The paper that
started it all, at least from a Java programmer's point of view.
Definitely worth a read, even if it's a bit dated. Jon is commonly
considered to be the father of XML. Funny how all of these technologies
seem to have paternity:
http://metalab.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html
- "Media-Independent Publishing: Four Myths about XML" Jon Bosak:
http://metalab.unc.edu/pub/sun-info/standards/xml/why/4myths.htm
- Robin Cover's XML-SGML site is, according to my SGML buddies, the
bible of XML resources:
http://www.oasis-open.org/cover/
- The W3C's XML resource page lets you cheer from the sidelines as XML
technology proposals develop into recommendations, or join in the fray
on their active mailing lists:
http://www.w3.org/XML/
- OASIS, the Web site of the Organization for the Advancement of
Structured Information Standards, offers general news and information
about XML:
http://www.oasis-open.org/
- The Graphics Communications Association, host of the XTech '99
conference (March 11 to 13, 1999, San Jose, CA) and the upcoming XML
Europe '99 conference in Granada, Spain, (April 26 to 30, 1999) has a
Web site packed with XML information:
http://www.gca.org/
- XML.com is great for watching trends and digging up XML news:
http://www.xml.com/
- Textuality hosts Tim Bray's site. Check it out for a look at the
"big picture" of how XML fits into the structured document universe --
and for a look at Lark, Tim's nonvalidating XML processor:
http://www.textuality.com/
- The XML FAQ:
http://www.ucc.ie/xml/
- IBM's XML Website is an outstanding supplement to alphaWorks:
http://www.software.ibm.com/xml/index.html
XML and Java
- "XML and Java: The Perfect Pair" by Ken Sall (Internet.com, November
1998) provides information about XML, Java, and why these two are a
match made in heaven:
http://wdvl.com/Authoring/Languages/XML/Java/index.html
Tutorials and training
- Generally Markup, Richard Lander's Web site may be of interest to
you if you haven't yet read enough about markup languages:
http://pdbeam.uwaterloo.ca/~rlander/
- The Mulberry Technologies Web site is a good resource for commercial
training in XML, as well as general XML and SGML consulting by seasoned
SGML experts:
http://www.mulberrytech.com/
- The Web Developer's Virtual Library Series on XML offers good
summaries of various XML technologies, as well as annotated indices of
XML software:
http://wdvl.com/Software/XML
- Microsoft's Site Builder Network provides a series of articles
called "Extreme XML," one of which appears in the following link. While
some of it focuses on Microsoft-only, Windows-only technology, there's
still some great stuff here:
http://www.microsoft.com/sitebuilder/magazine/xml.asp
- Webmonkey has a good series of articles introducing readers to XML.
The index is at:
http://www.hotwired.com/webmonkey/xml/?tw=xml
- "What the ?xml!" by L.C. Rees offers an interesting take on XML and
why it's necessary -- nicely written and entertaining to boot:
http://www.geocities.com/SiliconValley/Peaks/5957/wxml.html
- "The XML Revolution" by Dan Connolly is a quick backgrounder on XML
(Nature):
http://helix.nature.com/webmatters/xml.html
Cascading Style Sheets
- W3C's CSS page will get your started learning about CSS:
http://www.w3.org/Style/CSS/
- "Cascading Style Sheets Designing for the Web" by Hakom Wium Lie and
Bert Bos (Addison-Wesley, 1997) Sample chapters from the book appear at:
http://www.awl.com/cseng/titles/0-201-41998-X/liebos/
Extensible Style Language (XSL)
- The W3C's XSL page:
http://www.w3.org/Style/XSL/
- Read (and comment on) the W3C's XSL Working Draft (currently dated
December 16, 1998):
http://www.w3.org/TR/WD-xsl
- "The Extensible Style Language: Styling XML Documents"
(WebTechniques Magazine) XSL tutorial information and examples:
http://www.webtechniques.com/features/1999/01/walsh/walsh.shtml
- Microsoft's XML and XSL tutorial site is especially interesting
because of the recent release of client-side XSL in Internet Explorer
5.0. Extensive and excellent:
http://www.microsoft.com/xml
- If you're still using IE 4.0, you can still experiment with XML,
using Microsoft's internal DOM:
http://www.microsoft.com/xml/articles/xmlmodel.asp
- If you want to experiment with XSL, try downloading IBM's LotusXSL.
It's all Java, and for the time being, it's free:
http://www.alphaworks.ibm.com/tech/LotusXSL
- Or, you can try James Clark's XT XSL engine, downloadable from:
http://www.jclark.com/xml/xt.html
Upcoming XSL contest
Though the details aren't yet worked out, Sun Microsystems will soon
announce a call for proposals for a $30,000 grant to develop a
client-side processor for full XSL implementation in Mozilla.
It will also announce, in conjunction with Adobe, a contest (first prize
$40,000, second prize $20,000) to develop a pure-Java, server-side
processor of the entire XSL language, to format XML to PDF (Adobe's
document format). Keep watching the Java Developer Connection (requires
free registration), and Mozilla sites for the eventual announcements.
- "XTech '99: Java and the XML wave" by Mark Johnson
(JavaWorld, April 1999) offers the most current information on
the contest:
http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xtech.html
Simple API for XML (SAX)
- The definitive description of SAX is available online. You can also
download free SAX software here:
http://www.megginson.com/SAX/index.html
Document Object Model (DOM)
- The W3C information page for the Document Object Model appears on
the W3C site:
http://www.w3c.org/DOM/
- Among other things, you'll find the W3C Recommendation for DOM Level
1:
http://www.w3.org/TR/REC-DOM-Level-1/
- The Java bindings for DOM, for both XML and HTML, are in this
Recommendation appendix:
http://www.w3.org/TR/REC-DOM-Level-1/java-language-binding.html
- A great DOM tutorial by William Robert Stanek appears on PC
Magazine Online in "Object-Based Web Design." This tutorial
includes a discussion of using DOM with IDL, CORBA's Interface
Definition Language:
http://www8.zdnet.com/pcmag/pctech/content/17/13/tf1713.001.html
Dynamic HTML
- The Dynamic HTML Resource page contains several links to DHTML
articles:
http://www.hotwired.com/webmonkey/dynamic_html/?tw=dynamic_html
Software
- Epicentric, Inc.:
http://www.epicentric.com/
- More XML (and other Java) technology than you can shake a stick at
is available at IBM's alphaWorks:
http://alphaworks.ibm.com/
- Version 2 of IBM's excellent XML parser package, xml4j, is available
for download. This package includes several parsers, both validating and
nonvalidating:
http://www.alphaworks.ibm.com/tech/xml4j
- See also IBM's exciting Bean Markup Language project, which uses XML
to represent and manipulate JavaBeans:
http://www.alphaworks.ibm.com/tech/bml
- Another free Java XML parser was written by the indefatiguable James
Clark, download at:
http://www.jclark.com/xml/xp/index.html
- XEENA is IBM alphaWorks's DTD-guided XML editor. You want it, you
need it, you gotta have it:
http://www.alphaworks.ibm.com/tech/xeena
- Mozilla.org is the open source community's effort to extend the
Netscape source code. Find out about it at:
http://www.mozilla.org/
- Information about XML and CSS in Mozilla appears at:
http://www.mozilla.org/rdf/doc/xml.html
- You can read about Sun's XML and Java initiatives at:
http://www.sun.com/990310/java_xml.jhtml
- In addition, Java Project X includes source code downloadable from:
http://developer.java.sun.com/developer/earlyAccess/xml/index.html
- ArborText has a suite of sophisticated tools for editing SGML, XML,
and XSL:
http://www.arbortext.com/Products/products.html
- Oracle8i from Oracle corporation uses XML inside the Oracle core:
http://www.oracle.com/xml/
- Download Oracle's free XML for Java parser:
http://technet.oracle.com/direct/3xml.htm
- Microsoft's Internet Explorer 5.0, released this month, implements
part of the XSL spec. You can find it on Microsoft's Web site -- and
also just about anywhere else:
http://www.microsoft.com/windows/ie/default.htm
- You can also download a beta release of Microsoft's XML Notepad
editor (limited to running only on Microsoft Windows):
http://www.microsoft.com/xml/notepad/download.asp
- Vervet Logic of Bloomington, IN, has announced XML <PRO>, a
commercial XML editor:
http://www.vervet.com/
- Majix, to transform XML to HTML via XSL, is available at:
http://www.tetrasix.com/
- If your French is rusty, you might want to try the English-language
site at:
http://www.tetrasix.com/english/default.htm
History
- Read about the history of HTML here. It's part of an online book, so
there's no telling for how long it will be available:
http://ei.cs.vt.edu/~wwwbtb/hardcopy/book/chap4/origins.html The
two chapters listed below (of the book "HTML Unleashed" by Rick Darnell,
et al., also cover some of the technical background of these languages.
- SGML history
http://www.webreference.com/dlab/books/html/3-2.html
- XML history (such as it is):
http://www.webreference.com/dlab/books/html/38-0.html
- Nothing to do on Friday night? Why not read up on the history of
SGML? Charles Goldfarb, considered by many to be the "father of SGML,"
reminisces publicly at:
http://www.sgmlsource.com/Goldfarb/history/index.htm
- Useful XML and SGML information appears at Goldfarb's Web site,
including a comprehensive XML book list:
http://www.sgmlsource.com/
Miscellaneous links
- Uche Ogbuji has written an interesting article in
LinuxWorld about using XML on Linux in the Enterprise. It's at:
http://www.linuxworld.com/linuxworld/lw-1999-03/lw-03-xml.html
- Bluestone Software has recently made a splash with pure-Java XML
application servers, and a freely downloadable Swing package called
XwingML:
http://www.bluestone.com/
- Everyone (except Microsoft) is pretty freaked out about the US
Patent Office awarding Microsoft a patent for certain kinds of
functionality in style sheets. What happens with this patent, and its
impact on developing technology, remains to be seen. Judge for yourself
by reading the patent at:
http://www.patents.ibm.com/patlist?icnt=US&patent_number=5860073
- The title of the sample recipe is actually the title of a very funny
song by William Bolcom. Similar recipes may be found at:
http://www.b4uby.com/granny/gsoup.htm
- The song appears on a compact disc (with other odd songs) available
from the Public Radio Music Source at:
http://75music.org/best/docs/keepers.htm
|
 |